Main result 1 (What
is the x value?)
As stated previously we need our “x” value which will predict our “y”
value. Our “y” value being what we want to find, Strength of the egg and
the “x” value being what will be the best predictor of the “y” value. As
shown in the four plots above in the plot data section. Our two best
options are Housing and Thickness for the “x” value. However, since
Housing is Nominal data and Thickness is Ratio data we will go ahead and
choose the Thickness variable to be our “x” value.
# THICKNESS predicting STRENGTH
g = ggplot(eggs, aes(x = THICKNESS, y = STRENGTH, color = "blue")) + geom_point()
g = g + geom_smooth(method = "loess", formula = y ~ x)
g

Main result 2 (Models
are not looking good)
Huge Problem though, the data size of 28 is not a lot of data :(.
Which means that this prediction is not going to be the most reliable
because trends could be found by randomness (probability 0.5, but 8 out
of 10 are true when it should be 5 for example). Another official term
for this is the null hypothesis and will be shown later in this
document. So because of this, using a deterministic model is out of the
question (calculate future event exactly, without involvement of
randomness). However a questionable probabilistic model is still
something which we can make from this data (uses the effect of random
occurrences or actions to forecast the possibility of future
results).
A Simple Linear Regression Model (SLR) is one type of probabilistic
model that assumes that the data goes in a straight line. If we had data
that looked like it goes in a straight line this would make life easy,
but sadly this is not true.
This is mathematical notation for SLR \[y=\beta_0+\beta_1x_i+\epsilon_i\] The main
thing which you need to know is that \(\epsilon\) is the random error & \(\beta_0+\beta_1x\) is the mean value of y
given x.
Then there is a Quadratic Linear Regression Model which is a type of
probabilistic model that assumes that the data goes in a quadratic type
curve. Again, this is not the case for our egg data :(.
This is mathematical notation for Quadratic Linear Regression Model
\[y=\beta_0+\beta_1x_i+\beta_2x^2+\epsilon_i\]
The main thing which you need to know is that \(\epsilon\) is the random error & \(\beta_0+\beta_1x_i+\beta_2x^2\) is the mean
value of y given x. Same as SLR, but just, you know, quadratic.
Checks on
validity
library(s20x)
eggs.lm=lm(STRENGTH~THICKNESS,data=eggs)
summary(eggs.lm)
##
## Call:
## lm(formula = STRENGTH ~ THICKNESS, data = eggs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0704 -1.1199 0.5895 2.2627 3.1171
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.608 5.438 6.732 3.83e-07 ***
## THICKNESS 1.749 11.543 0.152 0.881
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.772 on 26 degrees of freedom
## Multiple R-squared: 0.0008822, Adjusted R-squared: -0.03755
## F-statistic: 0.02296 on 1 and 26 DF, p-value: 0.8807
eggs.qdlm=lm(STRENGTH~THICKNESS + I(THICKNESS ^ 2),data=eggs)
summary(eggs.qdlm)
##
## Call:
## lm(formula = STRENGTH ~ THICKNESS + I(THICKNESS^2), data = eggs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9088 -1.0282 0.3397 2.2476 3.8097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.04 49.85 -0.241 0.811
## THICKNESS 216.84 219.42 0.988 0.333
## I(THICKNESS^2) -235.25 239.65 -0.982 0.336
##
## Residual standard error: 2.774 on 25 degrees of freedom
## Multiple R-squared: 0.03796, Adjusted R-squared: -0.039
## F-statistic: 0.4933 on 2 and 25 DF, p-value: 0.6164
P-value for the qdlm is smaller, therefore better, but still a long
shot away to invalidating the null hypothesis (random chance).
Essentially, both still suck :( and are so high in fact that the null
hypothesis cannot be rejected (chance of randomness).
plot(eggs.lm, which = 1)

plot(eggs.qdlm, which = 1)

Residuals (red line) are not close to the dotted line (closeness to
mean) which means the data is worse.
ciReg(eggs.lm)
## 95 % C.I.lower 95 % C.I.upper
## (Intercept) 25.43065 47.78625
## THICKNESS -21.97711 25.47497
ciReg(eggs.qdlm)
## 95 % C.I.lower 95 % C.I.upper
## (Intercept) -114.7114 90.6377
## THICKNESS -235.0575 668.7300
## I(THICKNESS^2) -728.8292 258.3232
This range is where 95% of the data shows up.
Straight trend line
for eggs data
layout(1)
plot(eggs$STRENGTH~eggs$THICKNESS, bg="Blue",pch=21, xlim = c(0, 1.1*max(eggs$THICKNESS)), xlab = "Thickness", ylim = c(0, 1.1*max(eggs$STRENGTH)), ylab = "Strength of Egg", cex = 1.2, main = "Plot of Thickness vs Strength", data=eggs)
abline(eggs.lm)

plot(eggs$STRENGTH~eggs$THICKNESS)
myplot = function(x){
eggs.qdlm$coef[1] + eggs.qdlm$coef[2] * x + eggs.qdlm$coef[3] * x ^ 2
}
curve(myplot, lwd = 2, col = "steelblue", add = TRUE)

Trendscatter of
eggs data
trendscatter(eggs$STRENGTH~eggs$THICKNESS, data = eggs, f = 0.5)

Residual vs
fitted values for eggs data
strength.res = residuals(eggs.lm)
strength.fit = fitted(eggs.lm)
strength.qdres = residuals(eggs.qdlm)
strength.qdfit = fitted(eggs.qdlm)
Trendscatter on
Residual Vs Fitted for eggs data
trendscatter(strength.res~strength.fit)

trendscatter(strength.qdres~strength.qdfit)

The trend must go back to the middle dotted line and stay their to be
reliable, this is not the case as you can see here.
Independence of
data
t.test(eggs$STRENGTH,eggs$THICKNESS)
##
## Welch Two Sample t-test
##
## data: eggs$STRENGTH and eggs$THICKNESS
## t = 71.859, df = 27.016, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 35.90434 38.01495
## sample estimates:
## mean of x mean of y
## 37.4285714 0.4689286
They come from the same egg so no, there is no independence.